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A MAXIMUM LIKELIHOOD METHOD FOR THE INCIDENTAL 
PARAMETER PROBLEM 

By MARCELO J. MOREIRA 

Columbia University and FGV/EPGE 

This paper uses the invariance principle to solve the incidental 
parameter problem of [Econometrica 16 (1948) 1-32]. We seek group 
actions that preserve the structural parameter and yield a maximal 
invariant in the parameter space with fixed dimension. M-estimation 
from the likelihood of the maximal invariant statistic yields the maxi- 
mum invariant likelihood estimator (MILE). Consistency of MILE for 
cases in which the likelihood of the maximal invariant is the prod- 
uct of marginal likelihoods is straightforward. We illustrate this re- 
sult with a stationary autoregressive model with fixed effects and an 
agent-specific monotonic transformation model. 

Asymptotic properties of MILE, when the likelihood of the max- 
imal invariant does not factorize, remain an open question. We are 
able to provide consistent, asymptotically normal and efficient results 
of MILE when invariance yields Wishart distributions. Two examples 
are an instrumental variable (IV) model and a dynamic panel data 
model with fixed effects. 

1. Introduction. The maximum likelihood estimator (MLE) is a proce- 
dure commonly used to estimate a parameter in stochastic models. Under 
regularity conditions, the MLE is not only consistent but also asymptotic 
optimal (e.g., [26]). In the presence of incidental parameters, however, the 
MLE of structural parameters may not be consistent. This failure occurs be- 
cause the dimension of incidental parameters increases with the sample size, 
affecting the ability of MLE to consistently estimate the structural param- 
eters. This is the so-called incidental parameter problem after the seminal 
paper by [35]. 

This paper appeals to the invariance principle to solve the incidental pa- 
rameter problem. We propose to find a group action that preserves the model 
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and the structural parameter. This yields a maximal invariant statistic. Its 
distribution depends on the parameters only through the maximal invariant 
in the parameter space. Maximization of the invariant likelihood yields the 
maximum invariant likelihood estimator (MILE). Distinct group actions in 
general yield different estimators. We seek group actions whose maximal in- 
variant in the parameter space has fixed dimension regardless of the sample 
size. 

The use of invariance to eliminate nuisance parameters has a long history 
(e.g., [9]). However, the use of invariance to solve the incidental parameter 
problem is limited to only a few models (e.g., see [29] for estimate variance 
components using invariance to the mean). There has also been some dis- 
cussion on identifiability by [28] for additional groups of transformations. 
However, asymptotic properties of MILE are hardly addressed in the lit- 
erature. The difficulty in obtaining asymptotic results arises because the 
likelihood of the maximal invariant is often not the product of marginal 
likelihoods. 

An important methodological question is whether the use of invariance 
yields consistency and optimality in models whose number of parameters 
increases with the sample size. As is customary in the literature, we illustrate 
these results with a series of examples. 

To establish a context. Section 3 considers two groups of transformations 
whose use of invariance completely discards the incidental parameters. In 
both examples, the likelihood of the maximal invariant is the product of 
marginal likelihoods; consistency, asymptotic normality, and efficiency of 
MILE are straightforward. The first example is the stationary autoregres- 
sive model with fixed effects. For a particular group action, the solution 
coincides with [4] conditional and [15] and [25] integrated likelihood ap- 
proaches. The second example is the monotonic transformation model. The 
proposed transformation is agent-specific and has infinite dimension. The 
conditional and integrated likelihood approaches do not seem to be appli- 
cable here. The invariance principle provides an estimator that is consistent 
and asymptotically normal under the assumption of normal errors. 

We then proceed to the two main examples of the paper. For both ex- 
amples, invariance arguments yield Wishart distributions. Standardization 
of the likelihoods yields consistency, asymptotic normality, and optimality 
results for MILE. Although our theoretical findings are somewhat specific 
to Wishart distributions, we hope that interesting general lessons can be 
learned from studying those particular likelihoods. 

Section 4 considers an instrumental variable (IV) model with N obser- 
vations and K instruments. For the orthogonal group of transformations, 
MILE coincides with the LIMLK estimator. The asymptotic theory for the 
invariant likelihood unifies theoretical findings for LIMLK under both the 
strong instruments (SIV) and many weak instruments (MWIV) asymptotics 
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(e.g., [10, 22] and [31]). This framework parallels standard M-estimation in 
problems in which the number of parameters does not change with the sam- 
ple size. In particular, we are able to (i) show consistency of the MLE in the 
IV setup even under MWIV asymptotics from the perspective of likelihood 
maximization; (ii) derive the asymptotic distribution of the MLE directly 
from the objective function under SIV and MWIV asymptotics; and (iii) 
provide an explanation for optimality of MLE within the class of regular 
invariant estimators. 

Section 5 presents a simple dynamic panel data model with N individuals 
and T time periods. We propose to use MILE based on the orthogonal 
group of transformations. This estimator is novel in the dynamic panel data 
literature and presents a number of desirable properties. It is consistent, as 
long as NT goes to infinity (regardless of the relative rate of N and T) and 
asymptotically normal under (i) large N, fixed T; and (ii) large N, large 
T asymptotics when the autoregressive parameter is smaller than one. We 
derive an efficiency bound for large N, fixed T asymptotics when errors are 
normal; our bound coincides with [17] bound when T — > oo. MILE reaches 
(i) our bound when N is large and T is fixed; and (ii) [17] bound when 
both N and T are large. The bias-corrected ordinary least squares (BCOLS) 
estimator (e.g., [17]) only reaches the second bound. As a result, it is shown 
that MILE asymptotically dominates the BCOLS estimator. Finally, [13] 
use invariance to show that the correlated random effects estimator has a 
minimax property. The fixed effects estimator MILE also has a minimax 
property for the group of transformations considered here. 

Section 6 compares MILE with existing fixed-effects estimators for the 
dynamic panel data model. 

Section 7 concludes. The Appendix provides proofs for our results. 

2. The maximum invariant likelihood estimator. In this section, we re- 
visit the basic concepts of invariance (e.g., [16]) and their use to eliminate 
nuisance parameters. Let -By^^ denote the distribution of the data set y £ Y 
when the structural parameter is 7 G F and the incidental parameter is 
?7GN:£(y) = P^,^GP. 

We seek a group G and actions Ai{-,Y) and A2{- , {'y , rj)) in the sample 
and parameter spaces that preserve the model P: 

C{Y)=P^,^ £(^i(5,y))=P42(9,(7,r?)) for any P^,^ G P. 

We are interested in 7. This yields the following definition. 

Definition 2.1. Suppose that ^2:GxFxN^FxN induces an ac- 
tion ^3 : G X N ^ N such that 



•^2(9, (7>??)) = (7>-^3(5,??))- 
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Then the parameter 7 is said to be preserved. The incidental parameter 
space N is preserved if 

N = {77 G N; = A3{g, rj) for some rj £ N}. 

Suppose that both 7 and N are preserved. We can then appeal to the in- 
variance principle and focus on invariant statistics (piY) in which <j){Ai{g, Y)) = 
(j){Y) for every Y €y and g (z G. Any invariant statistic can be written as 
a function of a maximal invariant statistic defined below. 

Definition 2.2. A statistic M = M{Y) is a maximal invariant in the 
sample space if 

M{Y) = M{Y) if and only if Y = Ai (g, Y) for some g£G. 

An orbit of G is an equivalence class of elements Y, where Y r^Y (mod G), 
if there exists g £ G such that Y = Ai{g,Y). By definition, M is a maxi- 
mal invariant statistic if it is invariant and takes distinct values on different 
orbits of G. Every invariant procedure can be written as a function of a 
maximal invariant. Hence, we restrict our attention to the class of decision 
rules that depend only on the maximal invariant statistic. An analogous 
definition holds for the parameter space. 

Definition 2.3. A parameter 9 = 0(7, ry) is a maximal invariant in the 
parameter space if 0{^,rj) is invariant and takes different values on different 
orbits of G : O^^r^ = {^2(5, (7, r/)) € F x N; for some g £G}. 

The distribution of a maximal invariant M depends on (7, rj) only through 
0. If^2:GxrxN^rxN induces a group action ^3 : G x N ^ N, then 
9 = (7, A), where A G A is the maximal invariant in the nuisance parameter 
space N. The parameter set A is allowed to be the empty set. 

Definition 2.4. Let f{M;6) be the p.d.f./p.m.f. of a maximal invari- 
ant statistic (we shall abbreviate f{M;9) as the invariant likelihood). The 
maximum invariant likelihood estimator (MILE) is defined as 

9 = argmax/(M; 9). 

eee 

Comments. 1. Hereinafter, we assume the set Q to be compact. 

2. The estimator 9 is the same for any one-to-one transformation of M. 
Different group actions Ai{-,Y) and A2{- , {'y , r])) , however, yield different 
estimators. Hence, a better notation for 9 would indicate its dependence on 
the choice of group actions. 
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3. In general, we seek group actions Ai{-,Y) and A2{-, (7, v)) that preserve 
the model P and the structural parameter 7, and yield a maximal invariant 
A in N which has fixed dimension with the sample size. 

We introduce some additional notation. The superscript * indicates the 
true value of a parameter (e.g., 7* is the true value of the structural param- 
eter 7). The subscript N denotes dependence on the sample size (e.g., 
is the true value of the maximal invariant A when the sample size is N). In 
addition, let It be a T-dimensional vector of ones, Ojxk be a j x k matrix 
with entries zero, ej be a vector with entry j equals one and other entries 
zero. 

Hereinafter, additional notation is specific to each example. 

3. Transformations within individuals. In this section, we present two 
examples of transformations within individuals. Instead of P'y^-q, we work 
with -P^^^. , the probability of the model for agent i. This clarifies our expo- 
sition and highlights the fact that the likelihood of each maximal invariant 
M = (Ml, . . . , Mtv) is the sum of marginal likelihoods. In all examples below, 
the maximal invariant in the parameter space is 6 = j, with the objective 
function simplifying to 

1 ^ 

(3.1) QNi9) = j^Y.^nMmi;9), 

1=1 

where fi^mf, 9) is the marginal density of the maximal invariant Mj for each 
individual i. Because the MILE 9m maximizes Qn{9), consistency, asymp- 
totic normality and optimality ol 9n follow from standard results. 

Lemma 3.1. Let Qn{9) he defined as in (3.1) and take all limits as 

(a) Suppose that (i) supgg@ IQa^ (^) — Q{9) \ for a fixed, nonstochastic 
function Q{9), and (ii) Ve > 0, inig(^B{e* .e) Q(^) > Q(^*)- Then 

9n -^p 9* . 

(b) Suppose that (i) 9n — >p 9* , (ii) 9* G int(0), (iii) Qn{9) is twice contin- 
uously differentiable in some neighborhood of 9* , (iv) ^/NdQ]\f{9*)/d9 
N{0,I{9*)), and (v) supg^Q\d^QN{9*)/d9 d9' + 1{9)\ for some non- 
stochastic matrix that is continuous at 9* where X[9*) is nonsingular. Then 

Vn{9n - 9*) -^d N{0,I{9*)~^). 

(c) Suppose that (i) {Qn{9)',9 & Q} is differentiable in quadratic mean 
at 9* with nonsingular information matrix I (9*), and (ii) \fN(9is} — 9*) = 
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l{e*)-^^dQN{e*)/d9 + OQ^^^e'){l). Then 

'QnW) = hSN- -hl{e )h + OQ^^o*){l), 

where Sn N{0,T{9*)) under Qn{9*), and On is the best regular invariant 
estimator of 9* . 

Comment. Part (a) assumes (i) uniform convergence of Qn{9) and (ii) 
unique identifiability of 6*. Under the assumption that is compact, [7] 
show that Qn{9) -^p Q{9) uniformly, if and only if Qn{G) -^p Q{9) point- 
wise, and Qn{&) — Q{9) is stochastically equicontinuous. The nonstochastic 
function Q{9) satisfies the unique identifiability condition if 9 is identified 
and Q{9) is continuous. 

3.1. A linear stationary panel data model. As an introductory example, 
consider a linear stationary panel data model with exogenous regressors and 
fixed effects: 

Vit = T]i + x'it(3 + Uit, 

where yu G M and xa G are observable variables; ua are unobservable 
(possibly autocorrelated) errors, i = 1, . . . , iV, i = 1, . . . , T; /? G and 0-2 G 
M are the structural parameters; and r/j G M are incidental parameters, i = 
1,...,N. 

The model for yj. = [yn, . . . , yix]' G M"^ conditional on Xj. 
M^x^ is 

yi. N{7]ilT + Xi.p,a'^T,T) 

(3.2) 

1 

where = ^. 

1 — p : 

T-l 

LP 

Both the model and the structural parameter 7 = (/3, cr^, p) are preserved 
by translations g ■ Ij- (where 5 is a scalar), 

y^. + 5 • It iV((r?» + 5)1t + cj^St). 



[xii,...,XiT]' G 



p ■■■ p 
1 



T-1 



Proposition 3.1. Let g be elements of the real line with gi o g2 = 
91+92- If the actions on the sample and parameter spaces are, respectively, 
Ai{g,yi.) = {yi.+g-lT) and A2{g,{P,cr'^ , p,r]i)) = {/3,a'^ , p,r]i + g), then: 
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(a) the vector Mi = Dyi. is a maximal invariant in the sample space, where 
D is aT —IxT differencing matrix with typical row (0, . . . , 0, 1, — 1, 0, . . . , 0), 

(b) ^ is a maximal invariant in the parameter space, and 

(c) Mi N {Dxi- (3 , a"^ DTjtD') with density at mi = Dyi. given by 
/.K; /3, p, a') = (2vra2)-(^-i)/2|DSTl)r 

X exp|-^(2/i. - x^.pYD'iDJ^TD'r^Diyi. - Xi.p)^ 

Comment. Under regularity conditions (e.g., (i) J2^i yec{xi.)vec{xi.y 
^xx P-d., (ii) -^J2iLiUi-'S>yec{xi.)^dN{0,a*'^T.^^i}xx), where Ui. = 
[uii,...,UiT]', (iii) sup^>ijjJ2f=iEvec{xi.)vec{xi.y<oo, (iv) 1,0)^G, 
V/3, and (v) 9* £ int(G)), we can use Lemma 3.1 to show that 9n is consistent 
and asymptotically normal. 

3.2. A linear transformation model. Consider a simple panel data trans- 
formation model, 

ViiVit) = x'itiS + Uit, 

where yit G M and xn G M.^ are observable variables; ua G M are unobserv- 
able errors, i = 1, . . . ,N , t = 1, . . . ,T, with T > K ; rji '.M. ^ M is an unknown, 
continuous, strictly increasing incidental function; and /3 G is the struc- 
tural parameter. Unlike [2], we shall parameterize the distribution of the 

errors, ua^''^' N{ai,af). Because of location and scale normalizations, we 
shall assume without loss of generality that ua A^(0, 1). 
The model for yj. = (ya, yi2, • ■ • , IHt) G is then given by 

T 

PiVi- <v) = Y[ ^iViivt) - x'nf3) where v = [^1,^2,- • ■,vt]'- 
t=i 

Both the model and the structural parameter 7 = /? are preserved by 
continuous, strictly increasing transformations. 

Proposition 3.2. Let g be elements of the group of continuous, strictly 
increasing transformations, with giog2= gi{g2)- If the actions on the sam- 
ple and parameter spaces are, respectively, Ai{g, {yii,yi2, ■ ■ ■ iViT)) = {giVii)-, 
9{yi2),---,9{yiT)) and A2{9,{f3,r]i)) = {P,r]i{g~^)), then: 

(a) the statistic Mi = {Mn, . . . , Mi^) is the maximal invariant in the sam- 
ple space, where Mn is the rank of yn in the collection yn, . . . ,yiT, 

(b) the vector (3 is the maximal invariant in the parameter space, and 
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(c) Mi, i = 1, . . . , N , are independent with marginal probability mass func- 
tion of Mi at rrii given by 



fi{mii,...,miT;f3) = 



where V(i), ■ • ■ , V(t) ^-^ ordered sample from an N{0, 1) distribution. 

The likelihood of the maximal invariant also yields semiparametric meth- 
ods. For example, consider the case in which T = 2. If x'^2l3 > a^ii/?, then it 
is likely that yi2 > yn ■ This yields the semiparametric estimator of [2] . This 
estimator maximizes 

1 ^ 

QnW) = ^ Y.{H{yi2,y^l)I{^x'^(5 > 0) + H{yii,yi2)I{Ax'iP < 0)}, 

i=l 

where H is an arbitrary function increasing in the first and decreasing in 
the second argument. This estimator is very appealing as it is consistent un- 
der more general error distributions. For asymptotic normality, [2] proposes 
to smoothen the objective function to obtain asymptotic normality whose 
convergence rate can be made arbitrarily close to N~^^'^. In contrast, the 
MILE estimator suggested here does not require arbitrary choices of H or 
smoothening. 

4. An instrumental variables model. Consider a simple simultaneous 
equations model with two endogenous variables, multiple instrumental vari- 
ables (IVs) and errors that are normal with known covariance matrix. The 
model consists of a structural equation and a reduced-form equation: 

yi = y2/3 + u, 

y2 = Z-K + V2, 

where y 1,2/2 £ and Z G R^^^ are observed variables; u,V2 ^ are 
unobserved errors; and (3 ^ R and vr G R^ are unknown parameters. The 
matrix Z has full column rank K] the N x 2 matrix of errors [u : V2] is 
assumed to be i.i.d. across rows with each row having a mean zero bivariate 
normal distribution with a nonsingular covariance matrix; vr is the incidental 
parameter; and /? is the parameter of interest. 

The two-equation reduced-form model can be written in matrix notation 

as 



Y = Zira + V, 
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where Y = [yi:y2],V = [vi:v2] and a = 1)'. The distribution of T G R^^"^ 
is multivariate normal with mean matrix Zira' , independence across rows 
and covariance matrix S for each row. 

Because the multivariate normal is a member of the exponential family 
of distributions, low-dimensional sufficient statistics are available for the 
parameter (/3,7r')'. Andrews, Moreira and Stock [8] and Chamberlain [12] 
propose using orthogonal transformations applied to the sufficient statistic 
{Z'Z)-^/^Z'Y. The maximal invariant is Y'NzY, where Nz = Z{Z'Z)-^Z'. 

We shall use an invariance argument without reducing the data to a suf- 
ficient statistic. For convenience, it is useful to write the model in a canoni- 
cal form. The matrix Z has the polar decomposition Z = Lo{p' ,Oxx{N-K)y : 
where u is an N x N orthogonal matrix, and p is the unique symmetric, 
positive definite square root of Z'Z. Define R = lo'Y and let = pvT. Then 
the canonical model is 



Both model and structural parameters /3 and S are preserved by trans- 
formations 0{K) in the first K rows of R. The next proposition obtains the 
maximal invariants in the sample and parameter spaces. 

Proposition 4.1. Let g be elements of the orthogonal group of trans- 
formations 0{K) and partition the sample space R = {R't^^R^)' , where Ri 
is K X 2 and R2 is {N — K) x 2. If the actions on the sample and param- 
eter spaces are, respectively, Ai{g,R) = {{gR\)' ,R!2)' and ^2(55 (/?, S, r])) = 
{/3,^,gr]), then: 

(a) the maximal invariant in the sample space is M = (R[Ri,R2), and 

(b) the maximal invariant in the parameter space is 6^^ = (f3,T,, Xjsf), 
where Xn = rj'r]/N. 

To illustrate the approach, we assume for simplicity that S is known. 
Hence, we omit S from now on [e.g., Ojy = Ajv)]- 

The density of M is the product of the marginal densities of R'^Ri and 
R2 - Since R2 is an ancillary statistic, we can focus on the marginal density of 
R'lRi = Y'NzY in the maximization of the log- likelihood. As the density of 
Y'NzY is not well-behaved as N goes to infinity, we work with the density 



of Wn = N-'^Y'NzY instead. 

Theorem 4.1. The density of Wn = N~^Y'NzY evaluated at w is 
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(4.1) 

X I{K-2)/2{N^J\n ■ a'S-%S-ia), 

where C-f^ = 2(^+^)/^7r^/^r(-^^), denotes the modified Bessel func- 

tion of the first kind of order u, and T{-) is the gamma function. 

Define MILE as 

Ojsj = argmax(5Ar(6'), 

where Qn{0) = \-D.g{WN',ON) and On = ^n)-^ The next result shows 
that On = 6%i + Op(l) under general conditions. 

Theorem 4.2. (a) Under the assumption that — > oo with K fixed or 
K/N ^ 0, (i) i/A^ is fixed at\*>Q, then On 0* = {P* , X*) , (ii) if>^N 
A* > 0, then 9^ ^pO* = (/?*, A*) and (iii) i/ < Uminf A^ < hmsup A^ < 
oo, then On = Q*n + Op{l). 

(b) Under the assumption that iV — > oo with K/N — > q > 0, (i) if X*^ is 

fixed at X* > 0, then On -^p 0* = (/?*, A*), (ii) if X*^ -^p X* > 0, then On 
0* = (/?*, A*) and (iii) if < hminf A^ < hmsupA^ < oo, then On = 0*j^ + 
Op(l), where 6*^ = (/?*, A^). 

Comments. 1. Parts (a), (b)(i) yield consistency results conditional on 
A^; the remaining results of the theorem are unconditional on A^. Parts (a), 
(b)(ii) yield consistency results for f3* under SIV and MWIV asymptotics 
when A^ — >p A*. The assumption of A^ — >p A* is standard in the literature, 

but parts (a), (b)(iii) show that Pn -^p Pn without imposing convergence of 

2. This result also holds under nonnormal errors, as long as V{Wn) 0. 

Proposition 4.2. MILE of P is the limited information maximum like- 
lihood (LIMLK) estimator. 

Proposition 4.2 together with Theorem 4.2 explain why the LIMLK esti- 
mator is consistent when the number of instruments increases. The MILE 
estimator maximizes a log-likehhood function that is well-behaved as it de- 
pends on a finite number of parameters. The LIMLK estimator is consistent 
because it coincides with MILE. 



^The objective function Qn{6) is not defined if Wn is not positive definite (due to the 
term ln|Wjv|). To avoid this technical issue, we can instead maximize only the terms of 
Qn{0) that depend on 9. 
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Theorem 4.3. Let the score statistic and the Hessian matrix be 

^^(^^ = de ''^(^)= 8989' ' 

respectively, and define the matrix 

,2a*'S-ia* •e'iS-iei(a + 2A*a*'S-ia*) +a(a*'S-^ei)2 



Ia{9*) 



X* 



{a + A*a*'S-ia*)(a + 2X*a*'T.~'^a*) 



A* 



A* 



a + 2A*a*'S-ia* 
(a*'S-ia*)2 



2(a + 2A*a*'S-ia*) 

(a) Suppose that A^ is /ixed at X* > and iV — > oo wii/i ii' fixed. Then 
(i) ^/iV5^(r)^rfiV(0,Jo(r)), (ii) Hn{9*)^p -1o{9*), and (iii) ^/iV(^^ - 
9*)^aNiO,Io{9*r^). 

(b) Suppose that A^ is /ixeii A* > and N ^ oo with K/N a. Then 
(i) ^/NSn{9*) -^dN{0M9*)), (ii) i77v(e*) -^p -Ia{9*) and (iii) ViV(^iv - 
9*)^dNiO,Ia{9*y^). 

Comment. For convenience, we provide asymptotic results only for the 
case in which A^ is fixed at A* > 0. Small changes in the proofs also yield 
asymptotic results for A^ — >p A* . 

As a corollary, we find the limiting distribution of LIMLK. This result 
coincides with those obtained by [10] . 

Corollary 4.1. Define o"^ = b'T,b. Under SIV asymptotics (or under 
MWIV asymptotics with a = Q), conditional on X*^ = A* > 0, 

(4.2) ViV(^;v-r)-div(o,^). 
Under MWIV asymptotics, conditional on X*^ = A* > 0, 

(4.3) VNiPN - n iv(o, ^{a* + ■ 

Comments. 1. The limiting distribution given in (4.3) simplifies to the one 
given in (4.2) as a ^ 0. 

2. Instead of using the invariant likelihood to obtain a minimum distance 
(MD) estimator, we could instead use only its first moment. Define 

(4.4) m{WN;9N) = vech (^^) - vech (^aa' ■ Xn + . 
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If > 0, then the fohowing holds (for possibly nonnormal errors): 

(4.5) Ee*^{m{WN;e)) = if and only if 9n = 9n. 

Because the number of moment conditions does not increase under SIV or 
MWIV asymptotics, we can show that the MD estimator based on (4.4) and 

(4.5) is consistent and asymptotically normal. 

Finally, we obtain the following result under SIV and MWIV asymptotics 
in our setup. 

Theorem 4.4. Define the log-likelihood ratio 

An{0* + h ■ = N{QN{e* + h ■ N^^/^) - QNiO*)). 

(a) Under SIV asymptotics, 

(4.6) A^(r + h ■ iv-i/2, r ) = h'VNSNie*) - \h%{e*)h + OQ^(e*)(i), 

where ^/N Sn{0*) N{{),Io{e*)) under Qn{9*). 

(b) Under MWIV asymptotics, 

(4.7) Ajv(r + h ■ N~'/^9*) = h'^SNie*) - \h'i^{e*)h + OQ^(e.)(i), 

where VnSn{0*) N{0,la{0*)) under Qn{0*). 

Furthermore, the LIMLK estimator is asymptotically efficient within the 
class of regular invariant estimators under both SIV and MWIV asymptotics. 

Comments. 1. The proof of [14] uses asymptotic results by [19] for Wishart 
distributions. The standard literature on limit of experiments instead typi- 
cally provides expansions around the score (e.g., [27]). Theorem 4.3 shows 
that the score is asymptotically normal with variance given by the reciprocal 
of the inverse of the limit of the Hessian matrix. As the remainder terms are 
asymptotically negligible, (4.6) and (4.7) hold true. 

2. Theorem 4.4 requires the assumption of normal errors. Anderson, Ku- 
nitomo and Matsushita [6] exploit the fact that Wm involves double sums 
(in terms of N and K) to obtain optimality results for nonnormal errors. 

Under MWIV asymptotics, the LIMLK estimator achieves the bound 
(Jq,(6'*)"^)ii. Under SIV asymptotics, the bound {Tq{6*)~^)ii for regular 
invariant estimators of (3 is the same as the one achieved by limit of experi- 
ments applied to the likelihood of Y . Hence, there is no loss of efficiency in 
focusing on the class of invariant procedures under SIV asymptotics. 
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5. A nonstationary dynamic panel data model. Consider a simple dy- 
namic panel data model with fixed effects, 

yi,t = pyi,t~i + rji + Uit, 

where yu G M are observable variables and un ^ A^(0,cr^) are unobserv- 
able errors, z = 1, . . . , A^, t = 1, . . . , T; r^j G M are incidental parameters, i = 
1, . . . ,N; 7 = (/9, o"^) € M X M are structural parameters; and yifi are the ini- 
tial values of the stochastic process. We seek inference conditional on the 
initial values yifl = 0.^ 

In its matrix form, we have 

(5.1) [y.i,y.2,...,y.T] = p[y.o,y-i, ■ ■ ■ ,y-T-i] +??1t + [u.i,u.2, . . . ,u.t], 

where y.t = [yi,t,y2,t, yN,t]' e K^, u.t = [ui^t,U2,t, • • • , UN,t]' G K^, and r] = 
[r]i, . . . ,r]N]' G R^. Solving (5.1) recursively yields 

[y.i,y.2,...,y.T] =??(-Blr)'+ [u.i,u.2, . ■ ■ ,u.t]B' 

(5.2) 



where B 



T— 1 

P 



The inverse of B has a simple form. 



B ^ = D = It — p ■ Jt, where Jt ■ 



O't-i 



and Ot-1 is a (T — l)-dimensional column vector with zero entries. 

If individuals i are treated equally, the coordinate system used to specify 
the vectors y.t should not affect inference based on them. In consequence, it 
is reasonable to restrict attention to coordinate- free functions of y.f. Indeed, 
we find that orthogonal transformations preserve both the model given in 
(5.2) and the structural parameter 7 = (p, cj^). 



Proposition 5.1. Let g be elements of the orthogonal group of trans- 
formations 0{N). If the actions on the sample and parameter spaces are, 
respectively, Ai{g,Y) = gY and A2{g,{p,cr'^,r])) = {p,a'^,gr]), then: 

(a) the maximal invariant in the sample space is M = Y'Y , and 

(b) the maximal invariant in the parameter space is On = (7, Aat), where 
\N=ri'ri/{Na''). 

^We can assume that j/i,o = by writing the model as 

{y^,t - yi,o) ^ p{yi,t-i - y«,o) + - yi,o(l - p)) + Uit, 

for example, [25]. 
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Comment. If there is autocorrelation T,t that is homogeneous across indi- 
viduals, the maximal invariant M remains the same. The covariance matrix, 
however, changes to S = BT,tB' . 

For convenience, we standardize the distribution of M = Y'Y. 
Theorem 5.1. If N>T, the density of Wn = N'^Y'Y at w is 

X exp( -^tr(i:'u;£'') j exp( -^-Aivj 



where C^^^, = 2^?^/2-{^-2)/2^t(T-i)/4 j^t-i ^(iy^). 
Define MILE as 

9]y = argmax(5Ar(6'), 
eee 

where Qn{0) = {NT)-'^lng{WN;p,(j'^,X) and Bn = {p,(t^,Xn)-^ The next 
result shows that 9^ = 9*^ + Op{\) under general conditions. 

Theorem 5.2. (a) Under the assumption that N ^ oo with T fixed, (i) 
if X*j^ is fixed at A*, then 9^ -^p 9* = (p*, cj*^, A*), (ii) if A^ — >p A*, then 
9n —>-p 9* = {p* ,a*'^ , X*) and (iii) i/ limsup A^ < oo, then 9 n = 9]^ + Op{l) , 
where B*j^ = {p* , a*"^ , X*j^) . 

(b) Under the assumption that T — > oo and \ p*\ < 1, (i) if X*j^ is fixed at X* , 
then9N ^p9* = {p*,a*^,X*), (ii) if X*^ ^p X* , then 9n 9* = {p* ,a*^, X*) 
and (iii) i/limsup A^ < oo, then9N = 9*j^ + Op{l), where 9*j^ = {p* , a*'^ , X*j^) . 

Comments. 1. This result also holds under nonnormal errors. 
2. This theorem implies that pN -^p p* under the assumption that NT 
oo (regardless of the growing rate of and T). 

The next result derives the limiting distribution of MILE when N ^ oo. 



'*If N <T, Wn is not absolutely continuous with respect to the Lebesgue measure. We 
will still maximize the pseudo- likelihood to find 6n- 
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Theorem 5.3. Suppose that a*"^ > and is fixed at A* > 0, and let 
the score statistic and the Hessian matrix be 

- 09 8989' ' 

respectively, and define the matrix 

A* I^FIt 1 + A*r V^FIt 

IllT + ll'2.T + iT'^.T 



1t{9*) 



2o-*2 T l + 2X*T T 

A* I^FIt 1 ^ X* 2\*T 1 



2cj*2 T 2(cj*2)2 4cj*2 1 + 2A*r 4c7*2 

1 + A*r IJpFlT 1 1 



1 + 2A*T T 4cj*2 4A 

where DB* = + (p* — p)F and the three terms in the (1, 1) entry oflTifi*) 
are 

tijFF') , ,, 1'tF'F1t , 2A*2 {V^FIt? 

riiT = HA — , ii2,T ~ 



T ' (l + 2A*r) T 

and 



^3,T ■ 



X* / ILF'FIt ^ (I^FIt) 



1 + A*ri T T 

As N ^ oo with T fixed, 

(a) (i) VNTSNi9)^dN{0,lT{9*)), (ii) Hn{9*) -Irie*) and (iii) 

yiVr(^jv-^*)^diV(0,Xr(^*)"^), and 

(b) i/ie log-likelihood ratio is 

An{9* + h-{NT)-^/^,9*) 

(5.4) = NT{Qn{9* + • (iVT)-V2) _ )) 

= /i'VivrS;v(r ) - )/i + OQ^(e.)(i), 

\/iVr57v(^*)^diV(0,JT(^*)) under Qn{9*). Furthermore, 9]\[ is asymptot- 
ically efficient within the class of regular invariant estimators under large 
N , fixed T asymptotics. 

Comments. 1. It is possible to extend parts (a)(i), (iii) to nonnormal errors 
by finding the appropriate asymptotic distribution of \/ NTSj\f{9*). 

2. The MILE estimator pj\f achieves the bound {I'r{9*)~^)ii as N ^ oo, 
whereas the bias-corrected OLS estimator does not. 

3. Instead of using the invariant hkelihood to obtain an estimator, we 
could instead use only its first moment. Let Wi = yi.y[. , where yj. = [yi^i,yi,2, ■ ■ ■ 
yi^x]' £ , and define 

(5.5) rn{WN;9N) = vech(VFAr - cj^ vech(S{/T + A^v • ItIt}^))- 
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Then the fohowing holds: 

(5.6) Eg*^{m{WN;ON)) = if and only if 9^ = 0%. 

In the IV model, the number of moment conditions does not increase 
with N or K (see comment 2 to Corollary 4.1). In the panel data model, 
the number T(T + l)/2 of moment conditions given in (5.6) increases (too 
quickly) with T. Therefore, consistency and semiparametric efficiency re- 
sults (e.g., [3] and [34]) do not apply to (5.6) as T — > oo. Instead, Hahn and 
Kuersteiner [17] cleverly use Hajek's convolution theorem to obtain an effi- 
ciency bound for normal errors as T ^ co for the stationary case \p*\ < 1. 
The bias-corrected OLS estimator of p achieves [17] bound for large N, large 
T asymptotics. 

Our efficiency bound {It{9*)^^)ii reduces to [17] bound when T — > co. 
This shows that there is no loss of efficiency in focusing on the class of 
invariant procedures under large N, large T asymptotics. 

Corollary 5.1. Under the assumption that \p*\<l, the efficiency bound 
given by the (1, 1) coordinate of the inverse ofIoo{9*)~^ = i^i^T^oclT{G*))~^ 
converges to [17] efficiency bound of (1 — p*^) as T ^ oo. 

As a final result, the MILE estimator p^ also achieves the bound {Tt{0*)~^)ii 
for large A^, large T asymptotics. 

Theorem 5.4. Under the assumption that N >T ^ oo, \p*\ < 1, and 
X% IS fixed at X* > 0, (i) y/NTSN{9) N{0,I^{9*)), (ii) Hn{0*) 
-loo{9*) and (iii) VNT{dN-9*)^dN{Qaoo{9*r^). 

6. Numerical results. This section illustrates the MILE approach for es- 
timation of the autoregressive parameter p in the dynamic panel data model 
described in Section 5. The numerical results are presented as means and 
mean squared errors (MSEs) based on 1000 Monte Carlo simulations. These 
results are also available for other fixed-effects estimators: Arellano-Bond 
(AB), Ahn-Schmidt (AS) and bias-corrected OLS (BCOLS) estimators. 

We consider different combinations between short and large panels: N = 5, 
10, 25, 100 and r = 2, 3, 5, 10, 25, 100. 

Table 1 presents the initial design from which several variations are drawn. ^ 
This design assumes that rj* A^(0,4) (random effects), un A^(0, 1) 
(normal errors) and p* = 0.5 (positive autocorrelation). The value a* is fixed 
at one for all designs. 



*The full set of results for p, , and Ajv using different designs are available at 
http://www.columbia.edu/~mm3534/. 
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MILE seems to be correctly centered around 0.5. Even in a very short 
panel with N = 5 and T = 2, its bias of 0.0408 is quite smah. As N and/or 
T increases, its mean approaches 0.5. For example, for N = 5 and T = 25, 
the bias is around 0.0129; for = 25 and T = 2, the simulation mean is 
around 0.0040. These numerical results support the theoretical finding that 
MILE is consistent, as long as NT goes to infinity (regardless of the relative 
rate of N and T). The BCOLS estimator seems to have smaller bias than the 
AB and AS estimators for small N and large T. The AB and AS estimators 
have large bias with small N and T, but their performance improves with 
large N and small T. 

MILE also seems to have smaller MSE than the other estimators. The 
AS estimator outperforms the AB estimator in terms of MSE. The BCOLS 
estimator has smaller MSE than AS. The MSE of the BCOLS estimator. 

Table 1 

Performance of estimators for the autoregressive parameter p (random effects, 
normal errors, and p = 0.50) 



Mean MSE 



T 


JV 


MILE 


BCOLS 


AB 


AS 


MILE 


BCOLS 


AB 


AS 


2 


5 


0.4592 


0.9651 






0.1552 


0.4602 






2 


10 


0.4859 


0.9500 


* 




0.0631 


0.3109 




* 


2 


25 


0.4960 


0.9523 






0.0246 


0.2394 




* 


2 


100 


0.4974 


0.9474 






0.0054 


0.2083 




* 


3 


5 


0.4431 


0.7695 


-0.0578 


0.8642 


0.0631 


0.1607 


516.8489 


0.3823 


3 


10 


0.4789 


0.7903 


0.9766 


0.8954 


0.0280 


0.1165 


153.1105 


0.2559 


3 


25 


0.4908 


0.8008 


0.5705 


0.9389 


0.0115 


0.1045 


4.7087 


0.2219 


3 


100 


0.4979 


0.8068 


0.5372 


0.9632 


0.0024 


0.0975 


0.0724 


0.2204 


5 


5 


0.4626 


0.6469 


0.1980 


0.6541 


0.0231 


0.0538 


0.2323 


0.0991 


5 


10 


0.4802 


0.6657 


0.2386 


0.7162 


0.0116 


0.0422 


0.2145 


0.0820 


5 


25 


0.4935 


0.6702 


0.3768 


0.7940 


0.0044 


0.0347 


0.0869 


0.1002 


5 


100 


0.4991 


0.6799 


0.4650 


0.8667 


0.0010 


0.0336 


0.0136 


0.1371 


10 


5 


0.4731 


0.5505 


0.0385 


0.3753 


0.0122 


0.0158 


52.4500 


0.0747 


10 


10 


0.4861 


0.5660 


0.3249 


0.4518 


0.0049 


0.0107 


0.0489 


0.0437 


10 


25 


0.4937 


0.5717 


0.3977 


0.5763 


0.0021 


0.0074 


0.0211 


0.0294 


10 


100 


0.4993 


0.5736 


0.4625 


0.7223 


0.0005 


0.0060 


0.0058 


0.0550 


25 


5 


0.4871 


0.5128 


** 




0.0048 


0.0055 


** 




25 


10 


0.4930 


0.5151 


** 




0.0025 


0.0025 






25 


25 


0.4966 


0.5180 


** 




0.0010 


0.0013 


** 




25 


100 


0.4997 


0.5184 


** 




0.0002 


0.0006 


** 




100 


5 


0.4941 


0.5014 


** 




0.0014 


0.0013 


** 




100 


10 


0.4978 


0.5018 


** 




0.0007 


0.0007 


** 


** 


100 


25 


0.4990 


0.5001 


** 




0.0003 


0.0003 


** 




100 


100 


0.4997 


0.5015 






0.0001 


0.0001 


*+ 





(*) The estimator is not available for T = 2. 

(**) Computational cost is prohibitive for large T. 



18 



M. J. MOREIRA 



Table 2 

Performance of estimators for the autoregressive parameter p ( nonconvergent effects, 

normal errors, and p = 0.50 ) 



Mean MSE 



T 


N 


MILE 


BCOLS 


AB 


AS 


MILE 


BCOLS 


AB 


AS 


2 


5 


0.4770 


1.0835 


* 




0.0818 


0.5044 


* 




2 


10 


0.4911 


1.1389 


* 




0.0196 


0.4442 


* 


* 


2 


25 


0.4989 


1.1994 


* 




0.0037 


0.4959 


* 


* 


2 


100 


0.5000 


1.2352 


* 




0.0002 


0.5410 


* 


* 


3 


5 


0.4773 


0.8349 


0.2500 


0.9455 


0.0346 


0.1603 


384.7828 


0.3733 


3 


10 


0.4908 


0.9110 


0.5705 


0.9203 


0.0087 


0.1818 


0.5864 


0.2215 


3 


25 


0.4981 


0.9636 


0.5160 


0.8997 


0.0013 


0.2173 


0.0173 


0.1719 


3 


100 


0.4992 


0.9904 


0.5013 


0.8231 


0.0001 


0.2406 


0.0009 


0.1049 


5 


5 


0.4727 


0.6997 


0.2452 


0.7159 


0.0165 


0.0603 


0.1766 


0.0873 


5 


10 


0.4918 


0.7415 


0.4475 


0.7635 


0.0043 


0.0640 


0.0339 


0.0795 


5 


25 


0.4991 


0.7755 


0.4912 


0.7902 


0.0007 


0.0768 


0.0046 


0.0861 


5 


100 


0.4997 


0.7936 


0.4988 


0.7854 


0.0000 


0.0863 


0.0002 


0.0816 


10 


5 


0.4789 


0.5798 


-0.9436 


0.4278 


0.0080 


0.0151 


1721.7952 


0.0516 


10 


10 


0.4908 


0.6104 


0.4005 


0.5980 


0.0024 


0.0148 


0.0197 


0.0281 


10 


25 


0.5027 


0.6326 


0.4806 


0.7370 


0.0014 


0.0180 


0.0022 


0.0583 


10 


100 


0.5000 


0.6452 


0.4988 


0.7765 


0.0000 


0.0211 


0.0001 


0.0765 


25 


5 


0.4884 


0.5157 




** 


0.0040 


0.0042 






25 


10 


0.4949 


0.5330 




** 


0.0014 


0.0027 






25 


25 


0.4995 


0.5464 




*+ 


0.0003 


0.0024 






25 


100 


0.4999 


0.5562 




** 


0.0000 


0.0032 


+ !|! 




100 


5 


0.4964 


0.4994 




** 


0.0013 


0.0014 


** 


** 


100 


10 


0.4987 


0.5038 






0.0006 


0.0005 


** 


** 


100 


25 


0.4994 


0.5076 




** 


0.0002 


0.0002 




** 


100 


100 


0.5001 


0.5119 




** 


0.0000 


0.0002 







(*) The estimator is not available for T = 2. 

(**) Computational cost is prohibitive for large T. 

however, does not decrease if N increases but T is held constant. For T > 25, 
its performance is comparable to that of MILE. This provides numerical 
support for the theoretical finding that both MILE and BCOLS reach our 
large N, large T bound. 

Table 2 reports results for = (nonconvergent effects), normal errors 
and p* = 0.5. Table 3 presents results for random effects, uu (x^(l) ~ 
l)/\/2 (nonnormal errors) and p* =0.5. In both cases, MILE continues to 
have smaller bias and MSE than the other estimators. This result is surpris- 
ing with nonnormal errors as the AB and AS estimators could potentially 
dominate MILE when is large and T is small. 

Tables 4 and 5 differ from Table 1 only in the autoregressive parameter; 
respectively, p* = —0.5 (negative autocorrelation) and p* = 1.0 (integrated 
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Table 3 

Performance of estimators for the autoregressive parameter p ( random effects, 
nonnormal errors, and p = 0.50) 



Mean MSE 



T 


N 


MILE 


BCOLS 


AB 


AS 


MILE 


BCOLS 


AB 


AS 


2 


5 


0.4520 


0.9797 






0.1430 


0.5085 




* 


2 


10 


0.5024 


0.9975 


* 


* 


0.0869 


0.3687 


* 


* 


2 


25 


0.4993 


0.9665 




* 


0.0414 


0.2711 


* 


* 


2 


100 


0.5042 


0.9507 




* 


0.0105 


0.2175 


* 


* 


3 


5 


0.4666 


0.7910 


0.3562 


0.8923 


0.0687 


0.1811 


31.5729 


0.4008 


3 


10 


0.4803 


0.8056 


0.4189 


0.9204 


0.0343 


0.1373 


59.3092 


0.2723 


3 


25 


0.4951 


0.8054 


0.3363 


0.9376 


0.0143 


0.1104 


53.3848 


0.2233 


3 


100 


0.4992 


0.8091 


0.5244 


0.9683 


0.0030 


0.0999 


0.0839 


0.2278 


5 


5 


0.4712 


0.6629 


0.2628 


0.6585 


0.0268 


0.0647 


0.1905 


0.1359 


5 


10 


0.4821 


0.6704 


0.3211 


0.6975 


0.0150 


0.0456 


0.1282 


0.0872 


5 


25 


0.4928 


0.6778 


0.3899 


0.7748 


0.0045 


0.0380 


0.0810 


0.0914 


5 


100 


0.4967 


0.6798 


0.4717 


0.8539 


0.0011 


0.0339 


0.0128 


0.1291 


10 


5 


0.4722 


0.5602 


0.0781 


0.3906 


0.0110 


0.0175 


162.8453 


0.0840 


10 


10 


0.4893 


0.5663 


0.3471 


0.4507 


0.0047 


0.0105 


0.0405 


0.0516 


10 


25 


0.4946 


0.5721 


0.4084 


0.5625 


0.0020 


0.0077 


0.0178 


0.0309 


10 


100 


0.4984 


0.5745 


0.4740 


0.7154 


0.0005 


0.0061 


0.0035 


0.0514 


25 


5 


0.4819 


0.5113 


** 


** 


0.0052 


0.0046 


** 




25 


10 


0.4890 


0.5157 






0.0024 


0.0026 






25 


25 


0.4974 


0.5182 


** 


** 


0.0010 


0.0014 


** 




25 


100 


0.4990 


0.5187 


** 


** 


0.0003 


0.0006 


** 


** 


100 


5 


0.4949 


0.4997 


** 


*+ 


0.0015 


0.0014 


** 


** 


100 


10 


0.4972 


0.5004 




** 


0.0007 


0.0007 


** 




100 


25 


0.5000 


0.5015 


** 


+* 


0.0003 


0.0003 






100 


100 


0.5000 


0.5016 


** 


+* 


0.0001 


0.0001 


** 





(*) The estimator is not available for T = 2. 

(**) Computational cost is prohibitive for large T . 

model). Most — but not all — conclusions drawn from Table 1 hold here. MILE 
continues to outperform the AB and AS estimators in terms of mean and 
MSE. If p* = -0.5, MILE and BCOLS seem to perform similarly. If p* = 1.0, 
MILE again performs better than BCOLS for small values of T . 

7. Conclusion. A standard method to estimate parameters is the max- 
imum likelihood estimator (MLE). In the presence of nuisance parameters, 
this approach concentrates out the likelihood by replacing these parame- 
ters with maximum likelihood estimators. An alternative approach entails 
maximizing a likelihood that depends only on parameters of interest. This 
marginal likelihood approach (e.g., [18] and [20]) yields an estimator for the 
structural parameter that is often less biased and more accurate than MLE 
(e.g., [11] and [24]). 
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Table 4 

Performance of estimators for the autoregressive parameter p ( random effects, 
normal errors, and p— —0.50) 



Mean MSE 



T 


N 


MILE 


BCOLS 


AB 


AS 


MILE 


BCOLS 


AB 


AS 


2 


5 


-0.5489 


-0.5689 






0.1706 


0.2478 




* 


2 


10 


-0.5206 


-0.5622 






0.0694 


0.1020 


* 


* 


2 


25 


-0.5024 


-0.5485 






0.0269 


0.0374 


* 


* 


2 


100 


-0.5047 


-0.5476 






0.0058 


0.0104 


* 


* 


3 


5 


-0.4920 


-0.4907 


-0.0209 


-0.3722 


0.0801 


0.0791 


20.5152 


0.3044 


3 


10 


-0.5006 


-0.4994 


-0.4555 


-0.4485 


0.0326 


0.0352 


4.0370 


0.1651 


3 


25 


-0.5024 


-0.5087 


-0.4951 


-0.4990 


0.0117 


0.0146 


0.0409 


0.0578 


3 


100 


-0.5020 


-0.5063 


-0.4948 


-0.5368 


0.0031 


0.0033 


0.0080 


0.0129 


5 


5 


-0.4878 


-0.4728 


-0.5408 


-0.3755 


0.0339 


0.0371 


0.0549 


0.1201 


5 


10 


-0.4971 


-0.4871 


-0.5262 


-0.4113 


0.0156 


0.0202 


0.0326 


0.0713 


5 


25 


-0.5000 


-0.5007 


-0.5153 


-0.4608 


0.0069 


0.0073 


0.0136 


0.0310 


5 


100 


-0.4992 


-0.5021 


-0.5030 


-0.4860 


0.0017 


0.0017 


0.0033 


0.0069 


10 


5 


-0.4947 


-0.4779 


0.6536 


-0.4602 


0.0157 


0.0181 


3313.3070 


0.0343 


10 


10 


-0.4965 


-0.4944 


-0.5334 


-0.4563 


0.0083 


0.0078 


0.0098 


0.0211 


10 


25 


-0.4987 


-0.4951 


-0.5144 


-0.4541 


0.0031 


0.0032 


0.0046 


0.0122 


10 


100 


-0.4995 


-0.4984 


-0.5024 


-0.4552 


0.0008 


0.0008 


0.0014 


0.0041 


25 


5 


-0.4958 


-0.4921 


*+ 


+* 


0.0061 


0.0066 


** 




25 


10 


-0.4986 


-0.4952 


*+ 


+* 


0.0033 


0.0030 






25 


25 


-0.4988 


-0.4994 


+* 


*+ 


0.0013 


0.0012 


** 




25 


100 


-0.4996 


-0.4998 


** 


** 


0.0003 


0.0003 


** 




100 


5 


-0.4996 


-0.4986 


** 


*+ 


0.0016 


0.0015 


** 




100 


10 


-0.5002 


-0.4992 




** 


0.0008 


0.0008 






100 


25 


-0.4997 


-0.4999 


** 


** 


0.0003 


0.0003 






100 


100 


-0.5000 


-0.4993 


** 


+* 


0.0001 


0.0001 







(*) The estimator is not available for T = 2. 

(**) Computational cost is prohibitive for large T. 

If the number of nuisance parameters increases, MLE may not even be 
consistent. This paper proposes a marginal hkehhood approach to solve the 
incidental parameter problem. The use of invariance suggests which marginal 
likelihoods are to be maximized. We do not necessarily seek complete elimi- 
nation of the incidental parameters. The goal is to find a group of transfor- 
mations that preserves the structural parameters and yields a reduction in 
the incidental parameter space to a finite dimension. 

We illustrate this approach with four examples: a stationary autoregres- 
sive model with fixed effects; a monotonic transformation model; an instru- 
mental variable (IV) model; and a dynamic panel data model. In the first 
two examples, the invariant likelihoods are the products of marginal likeli- 
hoods and do not depend on the incidental parameters at all. In the last two 
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Table 5 

Performance of estimators for the autoregressive parameter p ( random effects, 
normal errors, and p = 1.00) 



Mean MSE 



T 


N 


MILE 


BCOLS 


AB 


AS 


MILE 


BCOLS 


AB 


AS 


2 


5 


0.9307 


1 


.6990 


* 




0.1316 


0.7595 






2 


10 


0.9766 


1 


.7115 


* 




0.0679 


0.6034 


* 


* 


2 


25 


1.0009 


1 


.6943 


* 


* 


0.0274 


0.5166 


* 


* 


2 


100 


0.9958 


I 


.7047 


* 


* 


0.0057 


0.5048 


* 


* 


3 


5 


0.9674 


1, 


.5029 


1.0935 


1.3267 


0.0452 


0.3211 


36.9311 


0.1953 


3 


10 


1.0072 


1, 


.5032 


1.0299 


1.3320 


0.0224 


0.2776 


5.5735 


0.1386 


3 


25 


0.9971 


1, 


.5156 


1.0120 


1.3469 


0.0059 


0.2733 


0.0313 


0.1318 


3 


100 


0.9975 


1, 


.5216 


0.9996 


1.3624 


0.0015 


0.2740 


0.0068 


0.1345 


5 


5 


0.9827 


1, 


.3241 


0.9478 


1.1497 


0.0093 


0.1190 


0.0313 


0.0363 


5 


10 


0.9949 


1, 


.3341 


0.9838 


1.1531 


0.0032 


0.1165 


0.0089 


0.0289 


5 


25 


0.9984 


1, 


.3403 


0.9919 


1.1659 


0.0012 


0.1174 


0.0030 


0.0294 


5 


100 


0.9999 


1, 


.3442 


0.9986 


1.1760 


0.0003 


0.1189 


0.0007 


0.0315 


10 


5 


0.9960 


1, 


.1774 


1.2028 


1.0534 


0.0015 


0.0330 


55.2326 


0.0065 


10 


10 


0.9989 


1, 


.1838 


0.9892 


1.0621 


0.0004 


0.0343 


0.0007 


0.0053 


10 


25 


0.9992 


1, 


.1839 


0.9960 


1.0680 


0.0001 


0.0340 


0.0002 


0.0051 


10 


100 


1.0000 


1, 


.1854 


0.9991 


1.0687 


0.0000 


0.0344 


0.0001 


0.0048 


25 


5 


0.9994 


1, 


.0765 




*+ 


0.0001 


0.0059 






25 


10 


1.0000 


1, 


.0767 




*+ 


0.0000 


0.0059 






25 


25 


0.9998 


1, 


.0776 




++ 


0.0000 


0.0060 






25 


100 


1.0000 


1, 


.0776 


** 


** 


0.0000 


0.0060 






100 


5 


1.0000 


1, 


.0197 




*+ 


0.0000 


0.0004 






100 


10 


0.9999 


1, 


.0198 






0.0000 


0.0004 






100 


25 


1.0000 


1, 


.0198 




** 


0.0000 


0.0004 






100 


100 


1.0000 


1, 


.0198 






0.0000 


0.0004 







(*) The estimator is not available for T = 2. 

(**) Computational cost is prohibitive for large T. 

examples, the invariant likelihoods are Wishart and depend on the incidental 
parameters through one-dimensional noncentrality parameters. 

For most groups of transformations, it is not possible to discard the in- 
cidental parameters completely. Because we allow invariant likelihoods to 
depend on incidental parameters, we have two considerations to make. First, 
finite-sample improvements may be possible using the orthogonalization ap- 
proach of [15] to the invariant likelihood (e.g., [23]). Second, we treat the 
incidental parameters as an arbitrary sequence of numbers. Other authors 
(e.g., [21]) instead consider the incidental parameters as independently and 
identically distributed chance variables with distribution function. It would 
be interesting to understand the costs and benefits of treating the incidental 
parameters as unknown constants or chance variables. 
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APPENDIX OF PROOFS 



Proofs of results stated in Section 3. 

Proof of Lemma 3.1. Part (a) follows from Theorem 5.7 of [37]. Part 
(b) follows from Theorem 3.1 of [33]. Part (c) follows from Theorem 12.2.3 
of [27] and Lemma 8.14 of [37]. □ 

Proof of Proposition 3.1. For part (a), we need to show that M{yi.) = 
M(Jji.) if and only if yi. = yi.+g- \t for some g. Clearly, M{yi.) is an invariant 
statistic, 



Now, suppose that M{yi.) = M[yi.). This implies that Dzi = for Zi = yi. — 
yi., which means that Zi belongs to the space orthogonal to the row space 
of D. Because rank(D) = T — 1, the orthogonal space has dimension one. 
As this space contains the vector It, it must be the case that Zi = g - It for 
some scalar g. Therefore, iji. =yi. + g - It- 

Part (b) follows from the fact that the group of transformations acts 
transitively on rji. Part (c) follows from the formula of the density of a 
normal distribution. □ 

Proof of Proposition 3.2. For part (a), let Ma be the rank of ya in 
the collection yn,. . . ,yiT- Formally, we can define Ma through ya = yi[Mu)- 
We shall abbreviate the notation, for example, (giyn), g{yi2), ■ ■ ■ ,9{yiT)) as 
g{yi.). The maximal invariant is Mj = (Mji, . . . , M^t) = M{yi.). We need 
to show that M{yi.) = M{yi.) if and only if yi. = g{yi.). Consider the case 
that ii t ^ t, then yn ^ y.-^ (this set has probability measure equal to one). 
Clearly, Mj is an invariant statistic. Now, suppose that M[yi.) = M{yi.). 
This implies that Mn = Mn, . . . , MiT = MiT- Therefore, yij.^ < • • • < yij^ and 
Viji < ■ ■ ■ < jjijrp ■ There is a continuous, strictly increasing transformation g 
such that jjit = g{yit), t = l,...,T. 

Part (b) follows from the fact that the group of transformations acts 
transitively on rji. 

For part (c), we note that because rji is an increasing transformation, Mn 
is also the rank in the collection y^^, . . . , y*rp, where y*^ = x'^^fi + un. We note 
that . . . ,y*rp are jointly independent with marginal densities 



M{yi. + 5 • It) = D{yi. + <? • It) = Dyi. + g ■ DIt = Dy,. = M (yi.). 




Now, we note that 



P{Mii = mil, ... , MiT = mr) 
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integrated over the set in which zu is the muth. smaUest element of zn,. . . , zit- 
We follow [27] and transform Wm^t = to obtain 

T 

P{Mii = mil , . . . , MiT = niiT) = Tlfit {wmu ; P) dw 

•'^1=1 



f TT fit{Wm,t^(i) 



where f{wt) is the density of a A^(0, 1) distribution and A = {w ^ M"^; wi < 
■ ■ ■ < wt}- Simple algebraic manipulations show that 



P{Mi=mi) 

= I ^^'&\-\^{wmu-x'itl3f + \^wl^\\{f{w^^;)dw 
•'A I ^ t=i ^ t=i J t=i 

r T ^ T \ T 

= / ^^vl^WmitX'itP- ^Y^{x'nl3f\\{f{Wm,t)dw 

•^^ U=i ^ t=i J t=i 

= ^ expj WmuXi}j P - ^P' Xitx'i^ P^Tl J[ f{Wrr,J dw, 

where T! HtLi /(^t) for lui < • • • < w^ is the p.d.f. of V(i) , . . . , V(^x) ■ ^ 

Proofs of results stated in Section 4. For convenience, we omit the sub- 
script in Aat. 

Proof of Proposition 4.1. For part (a), we need to show that M{Ri,R2) 
M{Ri,R2), if and only if {Ri,R2) = {gRi,R2) for some g£0{K). Clearly, 
M{yi.) is an invariant statistic, 

M{gRi,R2) = {RWgRi,R2) = iR'iRi,R2) = M{Ri,R2). 

Now, suppose that M{Ri,R2) = M{Ri, R2). This is equivalent to R'lRi = 
R'lRi and R2 = R2- But this implies that Ri = gRi (and, of course, R2 = 

R2). 

Part (b) follows analogously. □ 

Proof of Theorem 4.1. Following [5], the density function of Y'NzY 
at q is 

/(g)=Ci,i,-expf-— a'S-ia')|Sr^/2|g|(^-3)/2expf-itr(S-ig) 
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The density function of Wn is then 

g{w;f3,XN)=f{qiw)) ■ \q' iw)\ = f{qH)N^-^/^ 
which simphfies to (4.1). □ 

Proof of Theorem 4.2. The log-hkeUhood function divided by N is 
(A.l) - ^l^l^l + ^^^\Wn\ -\triJ:-'WM) 

+ lln(2(^-2)/2^(i^+2)/2^^^^)^ 

where Zn = 2^/ X ■ a'J^-^WN^-^a. 

All terms in the last two lines converge under both SIV and MWIV 
asymptotics (the only exception is In | Wn \ under SIV asymptotics and under 
MWIV asymptotics with a = 0). For example, the last term is 

1 i^(2(^-2)/2^(/r+2)/2^ i,) = - Inf -4^^^:^^) + 0(1) 
N ^ ' ' N \r{{K-l)/2)J ^ ' 

under both SIV and MWIV asymptotics. Under SIV asymptotics, 

N'\mK-l)/2))-'- 
Under MWIV asymptotics, we can use Stirling's formula to obtain 

1 . / N^K+2)/2 ^ 



N'\vi^K-l)/2))--2V-\2 

However, the second and third lines in (A.l) do not depend on 0. As a 
result, these terms can be ignored in finding the limiting behavior ol Om- 
Hence, define the objective function 

QN{e) = -\\- a'T.-\ + 1 ln(^Z-(^-')/' V_2)/2 (f ^iV 

The quantity depends on Wn- Following [32], Section 10.2, 

, K-T. + m'm K ■^ + TT'Z'Z7r-a*a*' , 
E{Wm) = = = + • a*a*'. 

From here, we split the result into SIV or MWIV with a = asymptotics, 
and MWIV with a > 0. 

For part (a), Wn = W^ + Op(l), where 

Wn = XN-a a . 
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Hence, = Z*^ + Op(l), where 



Z=V = 2^A-A^(a'S-ia*)2. 

The same holds for nonnormal errors, as long as V{W^) — > 0. 

Because K is fixed and N — > oo, Qn{0) = Qn{^) + Op(l) (uniformly in 
E compact), where 

Q^{e) = -^A • a'S-^a + \^'^\*^'\*'^-\. 

The first-order condition (FOC) for Qp^{9) is given by 

= -A • a'S-iei + AV2A;;/2a*'S-iei, 

5A 2 2 ^ 

The value 6** = (/?*, A^) minimizes 'Qn{^)^ setting the FOC to zero. 
For parts (a)(i), (ii), Qn{0) — >p Q{0), where 

Q(0) = Ax • a'S^^a + Ai/2A*V2o*'£"ia. 

Since G compact and Q{9) is continuous. On -^p 0. 

For part (a)(iii), we can define T{6,6y) = Qn{G) which is continuous. 
For each point the function t{0,6*j^) reaches the maximum at 6 = 9^. 
Because 6 compact and t(-,0^) is continuous, 

sup Qn{G)-Qn{&*n)= max Qj^{e) -QN{ey) = 5{ey) 

6'e0;||6»-6i;rl|>e f ee;|ie-e^||>£ 

Because < liminf A^ and limsupA^ < oo, there exists a compact set 0* 
such that ^ B* in which 6*j^ G 0* eventually. Using continuity of (5(-), 

sup 5{9*j^) = max 5{9%) = <5 < 

for large enough N . This implies 9*]^ is an identifiably unique sequence of 
maximizers of Qn{9), 

limsup sup Qn{(^) - Qn{(^*n) 

eee;\\9^9*^\\>e 

The result now follows from [36], Lemma 3.1. 

For part (b), Wn = W^ + Op{l) under SIV and MWIV asymptotics, where 

WN = a^ + X*N-a*a*'. 



26 



M. J. MOREIRA 



Hence, = Z"^ + Op(l), where Z"^ is defined as 



Z*^ = 2^X ■ a'S-i(aS + • a*a*')J:^^a. 

The same holds for nonnormal errors, as long as V{Wm) 0. For K/N — > 
a > 0, we use [1] to show that Qn{G) = Qn{9) + Op{l) (uniformly in 9 £ Q 
compact), where 



The first-order condition (FOC) for Qj^{9) is given by 

dQN{9) _ . ,y-i 2Aa-a'S-iei + A^-a*'S-ia-a*'S-iei 
~d^~~ ^'^^ l + (l + Z=^Va2)i/2 ^ 

dQNi.9) _ 1,^-1 la-a'S"ia + A^-(a*'S-ia)2 



2 



d\ 2 a l + (l + Z^Va2)i/ 

The value 9*j^ = {(3* , X*j^) minimizes Qn{9), setting the FOC to zero. 
For parts (b)(i), (ii), QtvC^) -^pQ{9) given by 

where Z* = 2^J\ ■ a'T.~'^{a£ + X* ■ a*a*')T.-^a. Since 6* G compact and 
Q{9) is continuous, ^at — >p 9. 

Part (b)(iii) follows analogously to part (a)(iii). □ 

Proof of Proposition 4.2. It follows from [12] that the integrated 
likelihood [over Haar measures for 0{k)] is maximized over a by 

a'S-V2y'Ar^yS-i/2G 

max . 

« a'a 

This optimal a is the eigenvector corresponding to the largest eigenvalue of 
5]^i/2y7V^yS^^/^. The integrated likelihood coincides with the likelihood 
of the maximal invariant and a is a transformation of /?. As a result, MILE 
is equivalent to LIMLK. □ 

Proof of Theorem 4.3. For part (a), when K is fixed or K/N 0, 

(A.2) Qn{9) = -^A • a'S-ifl + \^/'^{a'i:-^WN^'^af''^ + Op{N-^). 

All results below hold up to Op{N'~'^^'^) order. 



A MAXIMUM LIKELIHOOD METHOD 27 
The components of the score function S]\f{9) are 

____A.aS ei + A ^-^^-^^^^-^-^ , 

dQN{0) _ a'S^^a (a'S-iVFivS-ia)^^ 

The components of the Hessian matrix Hn{0) = H(Wj\f;0) are 

_ ^i/2_(a'S-il^^S-iei)2 



(a'S-W7vS-ia)3/2' 



9;39A " 2AV2(a's-W7vS-ia)V2' 

5A2 ~ 4 A3/2 

Because VFtv^p^^*, HN{e)^p-Io{9*). Furthermore, HN{9)^pH{W^;e) 
uniformly on ^ = (/3, A) for a compact set containing 6*, as long as A > 0. 
This completes part (a)(ii). To show part (a)(i), we write 



NSNiO*) = VNS(Wn;0*) = VN[S{Wn;0*) - S{W*;9*)]. 

Using vec(WAr) = Vt vec]i(W]y) , where Vt is the duplication matrix (e.g., 
[30]), we write 



^ NSNiO*) = VN[L{vedi{WN);e*) - L(vech(l^*); r )], 

where L:R3^R2. Now, \/iV(vech(WAr) - vech(VF*)) converges to a normal 
distribution by a standard CLT. As a result, using the delta method and 
the information identity, ^/NS]\f{6*) converges to a normal distribution with 
zero mean and variance 2q{0*). Part (iii) follows from [33]. 
For part (b), when K/N ^ a > 0, 

(A.3) QN{e) = -\x- a'^~^a + f (l + ||) " f In (l + (l + ||) '^') 

up to an Op{N~^) term. All results below hold up to Op(iV~^/2) order. 
The components of the score function Sn{6) are 

dQNiO) _ , 2A a'^-'WN^^^ei 

dp a l + (l + Z2,/a2)V2' 

dQN{0) _ a'S-^a ^ 1 a'S-iVF^vS-^a 



dX 2 al + (l + Z2^/Q2)i/2- 
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The components of the Hessian matrix H]\j{9) are 
-A • Cj^zj ei + 



a/32 ' ' a l + (l + Z2,/a2)i/2 

" a3(l + Z2,/a2)l/2 (1 + (1 + Z2,/a2)l/2)2 ' 

a/39A " ^ a 1 + (1 + Z2,/a2)i/2 

a3(l + Z2,/a2)i/2 (l + (l + Z2,/a2)l/2)2' 

d'^QNiO) _ -2 (fl^£~^VFjv£-^a)2 

5A2 ~ a3(l + Z2^/a2)i/2 (1 + (1 + Z2^/a2)i/2)2 • 

Parts (b)(i)-(iii) follow analogously to parts (a)(i)-(iii). □ 

Proof of Corollary 4.1. The determinant of Ia{d*) simplifies to 

^ A*2(a*/S"iQ*)2 a*'^~^a* ■ e^^'^ei - {a*'^~^eif 
' ^' a + 2A* •a*'S-ia* 2(a + A* • a*'S-ia*) 

Hence, the entry (1, 1) of the inverse of Xa(6'*) equals 

{iai0*)~ )ii = — TomT^^^rTrl-^^^^*)!" 

2(q! + 2A*a*'2j ^a*j 
_ a + A* •a*'S-ia* a^'S-^a* 

= 4|a*+ " 



A*2 t a*'S-ifl 

This expression coincides with the asymptotic variance of LIMLK as de- 
scribed in (4.7) of [10]: 



□ 



Proof of Theorem 4.4. This result follows from standard hmit of 
experiment arguments (see [14]). Part (a) follows from expansions based on 
(A. 2). Part (b) follows from expansions based on (A. 3). □ 

Proofs of results stated in Section 5. For convenience, we omit the sub- 
script in Ajv. For the next proofs, define the following four quantities: 

ci = tv{DB*B*'D') + XIjItB*'D'DB*It, 
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C2 = ItDB*B*'D'1t + Xlf{lTDB*lTf, 

C3 = I'^FIt + {p* - p)1'tF'FIt + X*1'tDB*It ■ I^FIt, 

C4 = {p* - p) tr(F'F) + A*{1^F1t + {p* - p)1'tF'FIt]. 

Proof of Proposition 5.1. We omit the proof here as it has been 
generahzed by [13]. □ 

Proof of Theorem 5.1. The density function of M at g is 

""^'"^ rr\ , 2\-NT/2\ \{N-T-l)/2 f ^ 



/(g) = C72,^.exp(-^rj(a2)-^^/2|g|(^-^-^)/2exp(^-^tr(I?gZ)'; 

-{N~2)/2 



"^'"^ I'j^DqD'lT] I^M-2)/2{Jj%lTDqD'lT]. 



The density function of Wn is then 

giw; P, Xn) = fiqiw)) ■ \q'iw)\ = fiqiw))N^^^^''^/\ 
which simphfies to (5.3). □ 

Proof of Theorem 5.2. The log-hkehhood divided by NT is 



Q^(^) = --lna -A 

1 , /„-{Ar-2)/2^ Z' N 



(A.4) +_ln(^Z-^^^-^^/^/(^_2)/2( -^^ 



where Z^ = 2yAit^^5^. 

The third hne is well-behaved when iV — > oo with T fixed. For example, 
using Stirling's formula, 



j^NT/2~{N~2)/22l/2 



^ vn£'/(A' - t)(^-*-i)/{2^) exp(-(iV - t)/(2A^)) 



+ o(i) 



ln(2) T - 1 
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In addition, Wn = + Op{l), where 



Now, 

\W*M = \B*\- \a*^{lT + A^ItI't)! • \B*'\ = {a*Y\lT + A^ItItI 

As a result, ln{WN) = rin(o-*2) + ln(l + X%T) + Op(l). 

It is unknown whether the third hne in (A. 4) is well-behaved with T — > oo. 
However, since it does not depend on 6, it can be ignored when finding the 
limiting behavior of 9^. Hence, define the objective function 

Q^(^) = --lna -A 

1 , /^-(Af-2)/2^ N 



In I I{N-2)/2[ 17 Zn 



From here, we split the result into fixed T and large T asymptotics. 
For part (a), in which N ^ oo with T fixed, = Z]^ + Op(l), where 



We use [1] to show that Qn{0) = Qn{(^) + Op(l), where 

-^ln(l + (l + ^^W- 

The first-order condition (FOC) for Q]\f{0) is given by 

dQNjO) _ a*^ jp* - p)tr{FF') + A*{1^F1t + jp* - p)1'tF'F1t} 
dp (j2 T 

2(7*2 ^ 



C72 1 + (1 + Z]^2)l/2 

l^Flr + jp* - p)1'tF'FIt + X*{T + {p* - p)1'^FIt)ItF1t 

dQ^jd) _ 1 a*^ ci a*^ A^ C2 

9(j2 2^2 + 2(a2)2 T (a2)2 1 + (1 + Z^2)i/2 T ' 

OQjv(g) ^ 1 a*2 1 

aA ~ 2 ^ cj2 1 + (1 + Zj^2)i/2 2- ■ 
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The value 9* = (p*,cr*^, A^) minimizes Qj^{6), setting the FOC to zero. 
For parts (a)(i), (ii), Qn{(^) Qi^) (uniformly in Q compact) given by 

^\ J 2 2cj2 T 2 2T^ ^ 

-^ln(l + (l + Z*2)i/2)^ 

where W* and Z* are defined as 



(A.5) W* = a*^B*{lT + \*lTlT)B*' and = 2^ A . 

Since 6 £ Q compact and Q{9) is continuous, 9j\i -^p 9. 
Part (a)(iii) follows analogously to Theorem 4.2(a) (iii). 
For part (b), the dimension of Wn changes as T— > cx). Yet, for \p*\ < 1, 

triDWND') ^. tT{DW;,D') , 
= lim h o„ 1 

and 

1'tDWnD'It ,. 1'tDW*^D'1t , 

— = iim — h o„ 1 . 

This approximation does not depend on how N grows with T. We use [1] to 
obtain Q n{9) = Q n{^) ~^ Op(l), where 

0^(6*) = --In cr^ ^ hm — — ^ A + - lim 

The first-order condition (FOC) for Qj^{9) is given by 

dQN{9) _ ^.^ (7*2 {p* - p) tr(FF') + \*{1'tFIt + (p* - p)1'tF'FIt} 



dp T^oo 0-2 T 

(a*2)l/2A*l/2Al/2 1/^71^ 

— hm 

jiToo (a2)i/2 T ' 

dQN{9) 1_ a*^ ci (a*2)i/2Ai/2A*i/2 i'^db*It 

da^ ~ " 2cj2 + 2(c72)2 T ~ 2(a2)3/2 T 

dQN{9) _ 1 ((j*2)1/2a*i/2 r^i^^nT 

dX ""2 ^T^cL 2(cj2)i/2ai/2 t 

The value 9* = (p*,cr*2, A^) minimizes Qi\f{9), setting the FOC to zero. 

For parts (b)(i), (ii), Qn{9) = Q{9) + Op{l) (uniformly in compact), 
given by 

TT.m 1,2 1 ,- tr{DW*D') 1^ 1 . Z* 
Q(9) = — In 0-2 hm — ^ A + - hm — , 
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where W* and Z* are defined in (A. 5). Since 6 & Q compact and Q{9) is 
continuous, 9j\f -^p 9. 

Part (b)(iii) follows analogously to Theorem 4.2(a)(iii). □ 

Proof of Theorem 5.3. First, we prove part (a). The objective func- 
tion is 

ln(l + (l + Z^)V^) 
2T 

up to an Op{N~^) term. All results below hold up to Op(iV~^/^) order. 
The components of the score function Sn{9) are 

dQN{9) _ 1 tvjJTWND') 2A 1'tJtWnD'It 

dp ~ ^ f 1 + (1 + Z2^)i/2 ? ' 

dQN{9) _ 1 ^ 1 tv{DWND') 



5cj2 2a2 2(a2)2 T 

1 A l^L>TyArL»'lT 

- (^2)2 1 + (1 + ^2^)1/2 T ' 

a(5iv(^) _ 1 1 1 I'tDWnD'It 

dX ~~2^^1 + (1 + ^2^)1/2 T ■ 

The Hessian matrix Hn{9) — >p —It{9), whose components are 

9^Qjv(g) _ cT*^ 2A l^yF^FlT + A(l^Flr)^ 

9p2 - ^ 1 + (1 + ZiV2)i/2 

o-*2 tr(F'F) + A*1^F'F1t 
a*2\2 8A2 1 (C3)2 



(72; (l + (l + Z=V2)l/2)2(l + Z^2)l/2 r ' 



dlQ{9) _ a*"^ C4 g*^ 2A C3 

9p5a2 ~~(^r ^ (^2)2 i + (i + ZiV2)V2 T 

CJ*2 2AC2 1 



X 1 



a2 l + (l + ZjV2)V2 (1 + Z=V2)V2;' 
2 C3 

9paA (j2 1 + (1 + Z=^2)l/2 2- 
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a2 l + (l + ZjV2)l/2(l + Z=^2)l/2j' 

d%Q{e) _ (a*2)2 1 (C2)2 

a(cT2)2 (^2)4 (1 + (1 + Z]^2)l/2)2 (1 + ^^2)1/2 ^ 

1 £7*2 Cl CJ*2 2A C2 

^ " r ^ ((j2)3 1 + (1 + Z^2)l/2 

_ 1 C2 L £7*2 2AC2 l_ 



da^ dX (C72)2 l + (l+Z^2)l/2 2- ^ ^2 l + (l + ^^2)l/2 (1 + ^^2)1/2 f ' 



5A2 V(t2; (1 + (1 + Zi^2)l/2)2 (1 + ^*2)1/2 ^ ' 

This convergence is uniform on 6 = {(3, A) for a compact set containing 9* , 
as long as A > 0. This completes part (a)(ii). To show part (a)(i), we write 

VnTSn{0*) = ^NTS{Wn]0*) = ^/NT[S{Wn;0*) - S{W*;e*)]. 

Using vec(WAr) = vech(Vl/Ar), where Vt is the duplication matrix (e.g., 
[30]), we write 



^/NTSN{e*) = ViVr[L(vech(VF7v);r) - L(vech(P^*); r )], 

where L : R^(^+i)/2 ^ r3. Now, \/]vr(vech(WAr) - vech(VF*)) converges 
to a normal distribution by a standard CLT. As a result, using the delta 
method and the information identity, \/ NTSn{6*) converges to a normal 
distribution with zero mean and variance 2^(0*). Part (iii) follows from 
[33]. 

Part (b) follows from the asymptotic normality of the score (whose vari- 
ance is given by the reciprocal of the inverse of the limit of the Hessian 
matrix). As the remainder terms from expansions based on (A. 6) are asymp- 
totically negligible, (5.4) holds true. □ 

Proof of Corollary 5.1. As a preliminary result, we need to find 
the limits of T"! tr(FF'), T^^l^Fly and T^^I^F'^It, as T^oo. For the 
first term, 

1 1 ^"^ ^ T-1 ^"^ 1 1 

- tr(FF') = -yy p*^' = - — - y p*^' --y ip*^' ^ 

T ^ ' T T ^/ T ' l-«*2' 

j=0 1=0 1=0 4=0 ' 

because Yiii=o ^(p*^)* is a convergent series. This is true because a sufficient 
condition for a series Yid=Q ^« converge is that lim ^/[otT < 1 as T ^ 
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oo. Taking Ci = i{p*^y , lim ^/\^\=\im y\T{p*^)^\=p*^lim VT = p*^<l. 
Analogously, 



1 / 1 



T-2 j 
j=0 i=0 



T - 1 1 ^ 1 



i=0 



i=0 



because J2J=q ip*^ also converges. Finally, by the Cauchy-Schwarz inequal- 
ity, 



1 / 



T-2 / j 



T ^ , 

i=0 \i=0 



< 



r- 1 



I- p* 



Taking limits, we obtain 

< liminf ll^F'FlT < limsup ll^F'Fl^ < J^^,- 

Hence, the limit of T^^l'rpF' Fix exists and equals (1 — p*)~^- 
Therefore, the limiting information matrix Xoo{0*) simplifies to 



1 



+ 



A* 



A* 



1 



l_p*2 (l_p*)2 2cr*2(l-p* 

A* 2 + A* 



2(1 -p* 
1 



1 

2(1 -P*) 

The entry (1, 1) of the inverse of Xool 



4(a*2)2 
1 



4o-*2 



4cj*2 

1 

4A* 



IS 



11 = 1 - p 



*2 



□ 



Proof of Theorem 5.4. When oo, the objective function is 

Qn{G) = Ino"^ t: — ^ — tJ— — ^ A —Zm 

^ivw 2 2o-2 T 2 2T 

up to an Op{N~^) term. All results below hold up to Op{N~^/'^) order. 
The components of the score function Sn{0) are 

dQN{0) _ 1 ii{JTWND') Ai/2 Vr^JrWND'lT 



dp 

dQNjO) 
9cj2 

dQN{e) 
dX 



1 



+ 



T (a2)V2T(l'yW^D'lT)i/2' 

1 iv{DWND') Ai/2 {V^DWnD'ItYI'^ 



2C72 ' 2(C72)2 T 2(cj2)3/2 

1 (1^L>VF7vL>'1t)^/2 



1 



2 2(a2)i/2Ai/2 
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If 1/3*1 is bounded away from one, as T ^ oo, 



and 



iijJTWND') 
T 

I'tJtWnD'It 
iT{DWND') 



r2 



lim 



lim 



lim 



lim 



tr{JTW*^D') 
T 

I'tJtW^D'It 
tr{DW;;D') 



1'tDW*^D'It 

r2 ■ 



As a result, the Hessian matrix —Hn[0) ^pXoo(^), whose components are 
limits of 

d'^QNiO) _ (T*2 tr(F'F) + Xn'^F'FlT 



d^QN{0) _ CT*2 C4 AV2A*V2(^*2)1/2 ^^^^^ 



dp da"^ 

dp dX 

d^QN{9) 
' a(a2)2 

' da^dX 
9A2 



(a2)2r 2(cj2)3/2 T ' 

(a*2)i/2;^*i/2 ^,^p^^ 

2(^2)1/2^3/2 T ' 

cj*2 ci 3(a*2)V2Ai/2A*i/2 1 

(^T ~ 4 (^2)572 

(ct*2)1/2a*i/2 1'^DB*1t 

4(^2)3/2^1/2 

(^.2)1/2^.1/2 1'^DB*1t 
4(^2)1/2^3/2 • 



T 



2(cr 



2^2 ■ 



and 



This convergence is uniform on = (/?, A) for a compact set containing 6* , 
as long as \p*\ is bounded away from one. This completes part (ii). To show 
part (i), define 

ftTiJrWND*') 1'^JtWnD*'1t tr{D*Wj^D*') 1'^D*W^D*'1t\' 



and 



V T 
l^tv{JTW^D* 



rp2 2^ 

1'tJtW*^D*'It iT{D*W%D*') 



and write 



T 



y/NTS 



J^2 



T 



2^2 

1'tD*W^D*'1t\' 

J^2 



VNf[L{WN; e*) - L(wiV; r )], 
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where L:R4 ^ R^. Now, ^/NT{WN - Wj^) converges to a normal distri- 
bution by a standard CLT and the Cramer-Wold device. Using the delta 
method and the information identity, V NTSn{6*) converges to a normal 
distribution with zero mean and variance I^{9*), as long as N >T. Part 
(iii) follows from [33]. □ 
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